OCR Correction and Query Expansion for Retrieval on OCR Data -- CLARIT TREC-5 Confusion Track Report
نویسندگان
چکیده
منابع مشابه
Report on the TREC-5 Confusion Track
For TREC retrieval from corrupted data was studied through retrieval of single target documents from a corpus which was corrupted by producing page images corrupting the bit maps and applying OCR techniques to the results In general methods which attempted a probabilistic estimation of the original clean text fare better than methods which simply accept corrupted versions of the query text
متن کاملRevisiting Known-Item Retrieval in Degraded Document Collections
Optical character recognition software converts an image of text to a text document but typically degrades the document’s contents. Correcting such degradation to enable the document set to be queried effectively is the focus of this work. The described approach uses a fusion of substring generation rules and context aware analysis to correct these errors. Evaluation was facilitated by two publ...
متن کاملThe TREC-6 Spoken Document Retrieval Track
The Text REtrieval Conference (TREC) workshops provide a forum for di erent groups to compare retrieval systems on common retrieval tasks. The 1997 TREC workshop will feature a Spoken Document Retrieval task for the rst time. This paper motivates the task and describes the measures to be used to evaluate the e ectiveness of the retrieval methodologies. 1. The Text REtrieval Conference The Text ...
متن کاملA Content-based Probabilistic Correction Model for OCR Document Retrieval
The difficulty with information retrieval for OCR documents lies in the fact that OCR documents comprise of a significant amount of erroneous words and unfortunately most information retrieval techniques rely heavily on word matching between documents and queries. In this paper, we propose a general content-based correction model that can work on top of an existing OCR correction tool to “boost...
متن کاملRMIT University at TREC 2008: Legal Track
This paper reports on the participation of RMIT university in the 2008 TREC Legal Track Ad Hoc task. OCR errors can corrupt the document view formed by an information retrieval system, and substantially hinder the successful retrieval of relevant documents for user queries. In previous research, the presence of errors in OCR text was observed to lead to unstable and unpredictable retrieval effe...
متن کامل